1 Introduction

These exercises are meant to show how to conceptually approach your data analysis but there are many more and different ways to explore your data. The most important thing to keep in mind is that you have to understand your own data and analyses. One way to achieve this is to perform visual explorations that help you to judge whether the data are appropriate for your question.

Now let’s start the fun!

In R studio.

2 Load libraries - working environnement

2.1 Packages for data manipulation and wrangling

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

2.2 Packages for microbiome analysis

## Loading required package: phyloseq
## 
## microbiome R package (microbiome.github.com)
##     
## 
## 
##  Copyright (C) 2011-2022 Leo Lahti, 
##     Sudarshan Shetty et al. <microbiome.github.io>
## 
## Attaching package: 'microbiome'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## The following object is masked from 'package:base':
## 
##     transform
## 
## Attaching package: 'microbiomeutilities'
## The following object is masked from 'package:microbiome':
## 
##     add_refseq

2.3 Packages for data visualization

3 Load the data

First we need to load our data. Usually the biggest bottleneck between raw data and analyses is to get the data in the right shape for your purpose. Often this requires a little bit of data mingling. On this road - Google is your best friend to master the R universe :)

Let’s first load the relative abundance table.

For this part we are using the phyloseq package. The phyloseq package is a tool to import, store, analyze, and graphically display complex phylogenetic sequencing data that has already been clustered into Operational Taxonomic Units (OTUs), especially when there is associated sample data, phylogenetic tree, and/or taxonomic assignment of the OTUs.

First st your working directory: ::: textbox

setwd("../Day_2/")

:::

And load your two files: ::: textbox

merged_biom_samples <- phyloseq::import_biom("../data/fastq/kraken/bracken/merge_species.biom") 
merged_metagenomes <- merged_biom_samples
meta <- readxl::read_excel("../data/Glossina_metadata.xlsx")

:::

3.1 Phyloseq-ize Data

Any data already in an R session can be annotated/coerced to be recognized by phyloseq’s functions and methods. This is important, because there are lots of ways you might receive data related to a microbiome project, and not all of these will come from a popular server or workflow that is already supported by a phyloseq import function.

  • otu_table - Works on any numeric matrix. You must also specify if the species are rows or columns
  • tax_table - Works on any character matrix. The row names must match the OTU names (taxa_names) of the otu_table if you plan to combine it with a phyloseq-object.
  • sample_data - Works on any data.frame. The row names must match the sample names in the otu_table if you plan to combine them as a phyloseq-object

Let’s look at the phlyseq object.

merged_biom_samples
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 91 taxa and 78 samples ]
## sample_data() Sample Data:       [ 78 samples by 1 sample variables ]
## tax_table()   Taxonomy Table:    [ 91 taxa by 7 taxonomic ranks ]

3.2 Summarize the data.

# Summarize the phyloseq object 'merged_biom_samples'
#microbiome::summarize_phyloseq(merged_biom_samples)

# Display the first few rows of the OTU (Operational Taxonomic Unit) table
head(otu_table(merged_biom_samples))
## OTU Table:          [6 taxa and 78 samples]
##                      taxa are rows
##      GI-103_S80.kraken_report_bracken_genuses
## 1688                                    12358
## 1686                                      133
## 1696                                       94
## 1762                                      359
## 1766                                       80
## 1764                                       55
##      GI-59_S79.kraken_report_bracken_genuses
## 1688                                    2367
## 1686                                      72
## 1696                                     214
## 1762                                     200
## 1766                                       0
## 1764                                       0
##      Gl-104_S42.kraken_report_bracken_genuses
## 1688                                     6272
## 1686                                       46
## 1696                                       59
## 1762                                      186
## 1766                                       38
## 1764                                       26
##      Gl-111_S43.kraken_report_bracken_genuses
## 1688                                      805
## 1686                                        0
## 1696                                        0
## 1762                                       38
## 1766                                        0
## 1764                                        0
##      Gl-121_S45.kraken_report_bracken_genuses
## 1688                                      119
## 1686                                        0
## 1696                                        0
## 1762                                        0
## 1766                                        0
## 1764                                        0
##      Gl-124_S46.kraken_report_bracken_genuses
## 1688                                       48
## 1686                                        0
## 1696                                        0
## 1762                                        0
## 1766                                        0
## 1764                                        0
##      Gl-125_S47.kraken_report_bracken_genuses
## 1688                                     1170
## 1686                                       69
## 1696                                      134
## 1762                                      125
## 1766                                        0
## 1764                                        0
##      Gl-14_S2.kraken_report_bracken_genuses
## 1688                                    416
## 1686                                      0
## 1696                                      0
## 1762                                     46
## 1766                                      0
## 1764                                      0
##      Gl-15_S3.kraken_report_bracken_genuses
## 1688                                    121
## 1686                                      0
## 1696                                      0
## 1762                                     53
## 1766                                      0
## 1764                                      0
##      Gl-16_S4.kraken_report_bracken_genuses
## 1688                                    298
## 1686                                      0
## 1696                                      0
## 1762                                      0
## 1766                                      0
## 1764                                      0
##      Gl-17_S5.kraken_report_bracken_genuses
## 1688                                     55
## 1686                                      0
## 1696                                      0
## 1762                                      0
## 1766                                      0
## 1764                                      0
##      Gl-25_S48.kraken_report_bracken_genuses
## 1688                                       0
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-26_S6.kraken_report_bracken_genuses
## 1688                                      0
## 1686                                      0
## 1696                                      0
## 1762                                      0
## 1766                                      0
## 1764                                      0
##      Gl-27_S7.kraken_report_bracken_genuses
## 1688                                    177
## 1686                                      0
## 1696                                      0
## 1762                                      0
## 1766                                      0
## 1764                                      0
##      Gl-28_S8.kraken_report_bracken_genuses
## 1688                                      0
## 1686                                      0
## 1696                                      0
## 1762                                      0
## 1766                                      0
## 1764                                      0
##      Gl-29_S9.kraken_report_bracken_genuses
## 1688                                      0
## 1686                                      0
## 1696                                      0
## 1762                                      0
## 1766                                      0
## 1764                                      0
##      Gl-30_S10.kraken_report_bracken_genuses
## 1688                                      59
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-31_S11.kraken_report_bracken_genuses
## 1688                                       0
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-32_S12.kraken_report_bracken_genuses
## 1688                                      51
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-33_S13.kraken_report_bracken_genuses
## 1688                                       0
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-34_S14.kraken_report_bracken_genuses
## 1688                                       0
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-36_S15.kraken_report_bracken_genuses
## 1688                                     163
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-38_S16.kraken_report_bracken_genuses
## 1688                                    4437
## 1686                                      53
## 1696                                      41
## 1762                                     124
## 1766                                       0
## 1764                                       0
##      Gl-39_S17.kraken_report_bracken_genuses
## 1688                                       0
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-45_S18.kraken_report_bracken_genuses
## 1688                                       0
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-47_S19.kraken_report_bracken_genuses
## 1688                                       0
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-51_S20.kraken_report_bracken_genuses
## 1688                                       0
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-52_S21.kraken_report_bracken_genuses
## 1688                                       0
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-53_S22.kraken_report_bracken_genuses
## 1688                                     167
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-54_S23.kraken_report_bracken_genuses
## 1688                                     452
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-56_S24.kraken_report_bracken_genuses
## 1688                                    7069
## 1686                                      80
## 1696                                      68
## 1762                                     186
## 1766                                      52
## 1764                                      54
##      Gl-61_S25.kraken_report_bracken_genuses
## 1688                                      99
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-62_S26.kraken_report_bracken_genuses
## 1688                                    6860
## 1686                                     217
## 1696                                     135
## 1762                                      74
## 1766                                       0
## 1764                                       0
##      Gl-63_S27.kraken_report_bracken_genuses
## 1688                                     207
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-64_S28.kraken_report_bracken_genuses
## 1688                                     112
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-66_S29.kraken_report_bracken_genuses
## 1688                                    3538
## 1686                                     196
## 1696                                     647
## 1762                                    1205
## 1766                                       0
## 1764                                     146
##      Gl-67_S30.kraken_report_bracken_genuses
## 1688                                     819
## 1686                                      22
## 1696                                      36
## 1762                                     252
## 1766                                       0
## 1764                                       0
##      Gl-69_S31.kraken_report_bracken_genuses
## 1688                                    5923
## 1686                                      61
## 1696                                      37
## 1762                                     145
## 1766                                      29
## 1764                                       0
##      Gl-70_S32.kraken_report_bracken_genuses
## 1688                                     970
## 1686                                       0
## 1696                                      42
## 1762                                     102
## 1766                                       0
## 1764                                       0
##      Gl-71_S33.kraken_report_bracken_genuses
## 1688                                    3971
## 1686                                    2663
## 1696                                     214
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-72_S34.kraken_report_bracken_genuses
## 1688                                      86
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-73_S35.kraken_report_bracken_genuses
## 1688                                     112
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-75_S36.kraken_report_bracken_genuses
## 1688                                       0
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-79_S37.kraken_report_bracken_genuses
## 1688                                     632
## 1686                                      39
## 1696                                      70
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-7_S1.kraken_report_bracken_genuses
## 1688                                    79
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      Gl-80_S38.kraken_report_bracken_genuses
## 1688                                       0
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-85_S39.kraken_report_bracken_genuses
## 1688                                      87
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-86_S40.kraken_report_bracken_genuses
## 1688                                     165
## 1686                                       0
## 1696                                       0
## 1762                                       0
## 1766                                       0
## 1764                                       0
##      Gl-88_S41.kraken_report_bracken_genuses
## 1688                                    8020
## 1686                                     333
## 1696                                     303
## 1762                                      88
## 1766                                       0
## 1764                                      73
##      I10_S77.kraken_report_bracken_genuses
## 1688                                     0
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      I111_S52.kraken_report_bracken_genuses
## 1688                                      0
## 1686                                      0
## 1696                                      0
## 1762                                      0
## 1766                                      0
## 1764                                      0
##      I113_S59.kraken_report_bracken_genuses
## 1688                                      0
## 1686                                      0
## 1696                                      0
## 1762                                      0
## 1766                                      0
## 1764                                      0
##      I117_S53.kraken_report_bracken_genuses
## 1688                                      0
## 1686                                      0
## 1696                                      0
## 1762                                      0
## 1766                                      0
## 1764                                      0
##      I11_S78.kraken_report_bracken_genuses
## 1688                                     0
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      I13_S67.kraken_report_bracken_genuses
## 1688                                     0
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      I14_S68.kraken_report_bracken_genuses
## 1688                                     0
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      I15_S69.kraken_report_bracken_genuses
## 1688                                     0
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      I16_S70.kraken_report_bracken_genuses
## 1688                                     0
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      I17_S71.kraken_report_bracken_genuses
## 1688                                     0
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      I19_S73.kraken_report_bracken_genuses
## 1688                                     0
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      I24_S49.kraken_report_bracken_genuses
## 1688                                     0
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      I28_S74.kraken_report_bracken_genuses
## 1688                                     0
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      I29_S75.kraken_report_bracken_genuses I2_S60.kraken_report_bracken_genuses
## 1688                                     0                                    0
## 1686                                     0                                    0
## 1696                                     0                                    0
## 1762                                     0                                    0
## 1766                                     0                                    0
## 1764                                     0                                    0
##      I35_S50.kraken_report_bracken_genuses
## 1688                                     0
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      I36_S51.kraken_report_bracken_genuses I3_S61.kraken_report_bracken_genuses
## 1688                                     0                                    0
## 1686                                     0                                    0
## 1696                                     0                                    0
## 1762                                     0                                    0
## 1766                                     0                                    0
## 1764                                     0                                    0
##      I4_S62.kraken_report_bracken_genuses I50_S54.kraken_report_bracken_genuses
## 1688                                    0                                     0
## 1686                                    0                                     0
## 1696                                    0                                     0
## 1762                                    0                                     0
## 1766                                    0                                     0
## 1764                                    0                                     0
##      I51_S55.kraken_report_bracken_genuses
## 1688                                     0
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      I52_S56.kraken_report_bracken_genuses
## 1688                                     0
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      I56_S57.kraken_report_bracken_genuses
## 1688                                     0
## 1686                                     0
## 1696                                     0
## 1762                                     0
## 1766                                     0
## 1764                                     0
##      I57_S58.kraken_report_bracken_genuses I5_S63.kraken_report_bracken_genuses
## 1688                                     0                                    0
## 1686                                     0                                    0
## 1696                                     0                                    0
## 1762                                     0                                    0
## 1766                                     0                                    0
## 1764                                     0                                    0
##      I6_S64.kraken_report_bracken_genuses I7_S65.kraken_report_bracken_genuses
## 1688                                   56                                    0
## 1686                                    0                                    0
## 1696                                    0                                    0
## 1762                                    0                                    0
## 1766                                    0                                    0
## 1764                                    0                                    0
##      I8_S66.kraken_report_bracken_genuses I9_S76.kraken_report_bracken_genuses
## 1688                                    0                                    0
## 1686                                    0                                    0
## 1696                                    0                                    0
## 1762                                    0                                    0
## 1766                                    0                                    0
## 1764                                    0                                    0
# Display the first few rows of the taxonomy table
head(tax_table(merged_biom_samples))
## Taxonomy Table:     [6 taxa by 7 taxonomic ranks]:
##      Rank1         Rank2           Rank3        Rank4          
## 1688 "k__Bacteria" "p__Firmicutes" "c__Bacilli" "o__Bacillales"
## 1686 "k__Bacteria" "p__Firmicutes" "c__Bacilli" "o__Bacillales"
## 1696 "k__Bacteria" "p__Firmicutes" "c__Bacilli" "o__Bacillales"
## 1762 "k__Bacteria" "p__Firmicutes" "c__Bacilli" "o__Bacillales"
## 1766 "k__Bacteria" "p__Firmicutes" "c__Bacilli" "o__Bacillales"
## 1764 "k__Bacteria" "p__Firmicutes" "c__Bacilli" "o__Bacillales"
##      Rank5               Rank6                 Rank7
## 1688 "f__Bacillaceae"    "g__Bacillus"         "s__"
## 1686 "f__Bacillaceae"    "g__Anoxybacillus"    "s__"
## 1696 "f__Bacillaceae"    "g__Geobacillus"      "s__"
## 1762 "f__Planococcaceae" "g__Lysinibacillus"   "s__"
## 1766 "f__Planococcaceae" "g__Rummeliibacillus" "s__"
## 1764 "f__Planococcaceae" "g__Planococcus"      "s__"
# Display the first few rows of the sample data associated with 'merged_biom_samples'
#head(sample_data(merged_biom_samples))

# Get the sample variables of the phyloseq object 'merged_biom_samples'
#sample_variables(merged_biom_samples)

We need to add the metadata to the phyloseq object.

3.3 Format the data

# Set the new column names in the phyloseq object
sample_names(merged_metagenomes) <- gsub("_.*$|\\.kraken_report_bracken_genuses$", "", sample_names(merged_metagenomes))
sample_names(merged_metagenomes) <- gsub("^Gl", "GI", sample_names(merged_metagenomes))
# Sort the 'meta' data frame by the 'SRA.identifier' column
meta$Sample <- gsub("_.*$", "", meta$Sample)
meta$Samples <- meta$Sample
meta <- meta %>% column_to_rownames(var = "Samples")

# meta$Sample == sample_names(merged_metagenomes)

# Associate the sorted metadata to the phyloseq object as sample data
merged_metagenomes@sam_data <- sample_data(meta)

# Remove the unnecessary 'k_' prefix in the taxonomy data
merged_metagenomes@tax_table@.Data <- substring(merged_metagenomes@tax_table@.Data, 4)

# Rename the columns of the taxonomy table to represent taxonomic ranks
colnames(merged_metagenomes@tax_table@.Data) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")

3.4 Basic stats

Before we start anything, let’s just check out or data a little bit (sanity check). Never go blind into your analyses.

head(psmelt(merged_metagenomes))
## Warning in psmelt(merged_metagenomes): The sample variables: 
## Sample
##  have been renamed to: 
## sample_Sample
## to avoid conflicts with special phyloseq plot attribute names.
##        OTU Sample Abundance sample_Sample                          spp sex
## 6382 46521 GI-125     45702        GI-125 Glossina palpalis gambiensis   F
## 6921 49957   I117     43682          I117 Glossina palpalis gambiensis   F
## 6937 49957   I113     37452          I113 Glossina palpalis gambiensis   F
## 6384 46521    I13     32991           I13 Glossina palpalis gambiensis   M
## 6319 46521     I8     30915            I8 Glossina palpalis gambiensis   F
## 6336 46521     I6     28814            I6 Glossina palpalis gambiensis   F
##      Ecological.gradient             Caracteristic.Gradient    infection
## 6382         farmed land rice, palm trees at mangrove perif     infected
## 6921                wild                           mangrove     infected
## 6937                wild                           mangrove     infected
## 6384                wild                           mangrove not infected
## 6319                wild                           mangrove not infected
## 6336                wild                           mangrove not infected
##        Kingdom         Phylum               Class            Order
## 6382  Bacteria Proteobacteria Gammaproteobacteria Enterobacterales
## 6921 Eukaryota     Euglenozoa       Kinetoplastea  Trypanosomatida
## 6937 Eukaryota     Euglenozoa       Kinetoplastea  Trypanosomatida
## 6384  Bacteria Proteobacteria Gammaproteobacteria Enterobacterales
## 6319  Bacteria Proteobacteria Gammaproteobacteria Enterobacterales
## 6336  Bacteria Proteobacteria Gammaproteobacteria Enterobacterales
##              Family          Genus Species
## 6382 Morganellaceae Wigglesworthia        
## 6921                   Trypanosoma        
## 6937                   Trypanosoma        
## 6384 Morganellaceae Wigglesworthia        
## 6319 Morganellaceae Wigglesworthia        
## 6336 Morganellaceae Wigglesworthia

Q: How many species and samples are detected in our data set?

HINT

ntaxa(merged_metagenomes), nsamples.

Q: Look at the taxonomy is there some problem with the taxonomy?

HINT

We have contaminant in the data. Which one? (head(psmelt(merged_metagenomes)))

Q: what are the taxa with the more abundant reads?

HINT

head(otu_table(merged_metagenomes))

Extraction of the sample’s names.

sample_names(merged_metagenomes)
##  [1] "GI-103" "GI-59"  "GI-104" "GI-111" "GI-121" "GI-124" "GI-125" "GI-14" 
##  [9] "GI-15"  "GI-16"  "GI-17"  "GI-25"  "GI-26"  "GI-27"  "GI-28"  "GI-29" 
## [17] "GI-30"  "GI-31"  "GI-32"  "GI-33"  "GI-34"  "GI-36"  "GI-38"  "GI-39" 
## [25] "GI-45"  "GI-47"  "GI-51"  "GI-52"  "GI-53"  "GI-54"  "GI-56"  "GI-61" 
## [33] "GI-62"  "GI-63"  "GI-64"  "GI-66"  "GI-67"  "GI-69"  "GI-70"  "GI-71" 
## [41] "GI-72"  "GI-73"  "GI-75"  "GI-79"  "GI-7"   "GI-80"  "GI-85"  "GI-86" 
## [49] "GI-88"  "I10"    "I111"   "I113"   "I117"   "I11"    "I13"    "I14"   
## [57] "I15"    "I16"    "I17"    "I19"    "I24"    "I28"    "I29"    "I2"    
## [65] "I35"    "I36"    "I3"     "I4"     "I50"    "I51"    "I52"    "I56"   
## [73] "I57"    "I5"     "I6"     "I7"     "I8"     "I9"

3.5 Aggregation

Microbial species can be called at multiple taxonomic resolutions. We can easily agglomerate the data based on taxonomic ranks. Here, we agglomerate the data at Genus level.

level <- "Genus"
merged_metagenomes_taxfull <- merged_metagenomes
# Aggregate rare taxa at the family level for the phyloseq object 'merged_metagenomes'
merged_metagenomes <- aggregate_rare(merged_metagenomes, level = level, detection = 0/100, prevalence = 0/100)

tax_table_df <- as.data.frame(merged_metagenomes@tax_table)
otus_to_keep <- rownames(tax_table_df[tax_table_df$Genus != "Other", ])
merged_metagenomes <- prune_taxa(otus_to_keep, merged_metagenomes)

# Display the dimensionality of the abundances of 'merged_metagenomes_family'
dim(abundances(merged_metagenomes))
## [1] 89 78

Q: How many sample and tax do we have now?

HINT

There are XX taxa and 80 samples, meaning that there are XX different Phylum level taxonomic groups. Looking at the rowData after agglomeration shows all Enterococcaceae are combined together, and all lower rank information is lost.

head(tax_table(merged_metagenomes))
## Taxonomy Table:     [6 taxa by 2 taxonomic ranks]:
##                                                    Genus                                               
## Acinetobacter                                      "Acinetobacter"                                     
## Agarivorans                                        "Agarivorans"                                       
## Agromyces                                          "Agromyces"                                         
## Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium "Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium"
## Altererythrobacter                                 "Altererythrobacter"                                
## Anoxybacillus                                      "Anoxybacillus"                                     
##                                                    unique                                              
## Acinetobacter                                      "Acinetobacter"                                     
## Agarivorans                                        "Agarivorans"                                       
## Agromyces                                          "Agromyces"                                         
## Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium "Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium"
## Altererythrobacter                                 "Altererythrobacter"                                
## Anoxybacillus                                      "Anoxybacillus"

4 QC and Pre-process data

Now that we know a little bit about our data we can start the pre-processing.

4.1 Library size / read count

Let us check for distribution of number of sequences retained from the Kraken/Bracken approach.

plot_read_distribution(merged_metagenomes, "sex", plot.type = "density")

You can try to plot with different metadata.

Plotting the read count per sample

df <- psmelt(merged_metagenomes)  %>%  group_by(Sample, sex) %>%  
  summarise(sum_reads = sum(Abundance)) %>% arrange(sum_reads) 
## Warning in psmelt(merged_metagenomes): The sample variables: 
## Sample
##  have been renamed to: 
## sample_Sample
## to avoid conflicts with special phyloseq plot attribute names.
## `summarise()` has grouped output by 'Sample'. You can override using the
## `.groups` argument.
ggplot(df) +
  geom_bar(aes(reorder(Sample, -sum_reads), sum_reads, fill=sex),
           col="red", alpha = .7, stat="identity") 

Q: What do observe, what will be the consequences

4.2 Contaminant and filtering targeted data

Samples might be contaminated with exogenous sequences. We have observed 1 contaminants Homo sapiens.

#check 

head(psmelt(subset_taxa(merged_metagenomes, Genus == "Trypanosoma")))
## Warning in psmelt(subset_taxa(merged_metagenomes, Genus == "Trypanosoma")): The sample variables: 
## Sample
##  have been renamed to: 
## sample_Sample
## to avoid conflicts with special phyloseq plot attribute names.
##            OTU Sample Abundance sample_Sample                          spp sex
## 54 Trypanosoma   I117     43682          I117 Glossina palpalis gambiensis   F
## 53 Trypanosoma   I113     37452          I113 Glossina palpalis gambiensis   F
## 31 Trypanosoma  GI-59     27861         GI-59 Glossina palpalis gambiensis   F
## 47 Trypanosoma  GI-85     23509         GI-85 Glossina palpalis gambiensis   F
## 44 Trypanosoma  GI-75     15734         GI-75 Glossina palpalis gambiensis   F
## 66 Trypanosoma    I35     13337           I35 Glossina palpalis gambiensis   F
##    Ecological.gradient             Caracteristic.Gradient infection       Genus
## 54                wild                           mangrove  infected Trypanosoma
## 53                wild                           mangrove  infected Trypanosoma
## 31             ecotone                 mangrove periphery  infected Trypanosoma
## 47             ecotone                 mangrove periphery  infected Trypanosoma
## 44         farmed land rice, palm trees at mangrove perif  infected Trypanosoma
## 66                wild                           mangrove  infected Trypanosoma
##         unique
## 54 Trypanosoma
## 53 Trypanosoma
## 31 Trypanosoma
## 47 Trypanosoma
## 44 Trypanosoma
## 66 Trypanosoma
#Keep only the kingdom of interest
merged_metagenomes <- subset_taxa(merged_metagenomes_taxfull, Kingdom =="Bacteria")
merged_metagenomes <- aggregate_rare(merged_metagenomes, level = level, detection = 0/100, prevalence = 0/100)

tax_table_df <- as.data.frame(merged_metagenomes@tax_table)
otus_to_keep <- rownames(tax_table_df[tax_table_df$Genus != "Other", ])
merged_metagenomes <- prune_taxa(otus_to_keep, merged_metagenomes)

4.3 Microbiome composition

Microbial abundances are typically ‘compositional’ (relative) in the current microbiome profiling data sets. This is due to technical aspects of the data generation process (see e.g. Gloor et al., 2017).

The next example calculates relative abundances as these are usually easier to interpret than plain counts. For some statistical models we need to transform the data into other formats as explained in above link (and as we will see later).

detection: Detection threshold for absence/presence (percentage reads). prevalence: Prevalence threshold (in [0, 1]) (presence across samples).

pseq <- microbiome::transform(merged_metagenomes, transform = "compositional")

Q: what is the meaning and effect of compositional?

HINT

abundances(pseq)

4.4 Visualization

Bar plot are useful to have a broad look at the data.

pseq <- microbiome::transform(merged_metagenomes, transform = "compositional")

# Create a plot of the taxonomic composition at the level selected
p <- plot_composition(pseq,
                      taxonomic.level = "Genus", # Specify the taxonomic level for the plot
                      sample.sort = "Sample",     # Sort samples by sample identifier
                      x.label = "Sample",         # Label for the x-axis
                      group_by = "sex") +          # Group samples by the 'Time' variable

  # Use a color palette from the Color Brewer for filling the Family groups
  #scale_fill_brewer("Species", palette = "magma-viridis") +

  # Arrange the legend items in a single column
  guides(fill = guide_legend(ncol = 1)) +

  # Convert the y-axis to represent relative abundance as a percentage
  scale_y_percent() +

  # Add labels and a title to the plot
  labs(x = "Samples", y = "Relative abundance (%)",
       title = "Relative abundance data - Full dataset") + 

  # Apply a clean theme with a horizontal grid
  theme_ipsum(grid = "Y") +

  # Customize the theme for x-axis text and legend text
  theme(axis.text.x = element_text(angle = 90, hjust = 1), # Rotate x-axis text 90 degrees for readability
        legend.text = element_text(face = "italic"),    # Italicize the legend text
        legend.position = "right")

# Print the plot
print(p)

Q: Does the profiles look similar between samples? Can you spot any trends?

HINT

A: We see that the same one or two OTU dominates all samples but there is some variability between samples. And ?

If you have time you can also visualize the other taxonomic levels (e.g. species) with the same approach. Try to come up with the code yourself.

n <- dim(tax_table(pseq))[1]
qual_col_pals = brewer.pal.info[brewer.pal.info$category == 'qual',]
col_vector = unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))
col_vector <- rep(col_vector, length.out = 84)
# Create a plot of the taxonomic composition at the Genus level
p <- plot_composition(pseq,
                      taxonomic.level = "Phylum", # Specify the taxonomic level for the plot
                      sample.sort = "Sample",    # Sort samples by sample identifier
                      x.label = "Sample",        # Label for the x-axis
                      otu.sort = "abundance",    # Sort OTUs by their abundance
                      group_by = "sex") +       # Group samples by the 'Type' variable

  # Use a color palette from the Color Brewer for filling the Genus groups
  scale_fill_manual(values = col_vector) +

  # Arrange the legend items in a single column
  guides(fill = guide_legend(ncol = 1)) +

  # Convert the y-axis to represent relative abundance as a percentage
  scale_y_percent() +

  # Add labels and a title to the plot
  labs(x = "Samples", y = "Relative abundance (%)",
       title = "Relative abundance data") + 

  # Apply a clean theme with a horizontal grid
  theme_ipsum(grid = "Y") +

  # Customize the theme for x-axis text and legend text
  theme(axis.text.x = element_text(angle = 90, hjust = 1), # Rotate x-axis text 90 degrees for readability
        legend.text = element_text(face = "italic"))       # Italicize the legend text

# Print the plot
print(p)

What do you Observe? Is it correct? What action should be taken?

5 Final figures

5.1 Run rarefaction curves

Before normalization by sub-sampling, let’s have a look at rarefaction curves, evaluate your sequencing effort and make decisions. Rarefaction is a statistical technique employed to evaluate species richness based on sampling results. The process involved sub-sampling reads without replacement to a defined sequencing depth, thereby creating a standardized library size across samples. Any sample with a total read count less than the defined sequencing depth used to rarefy will be discarded. What we want to see in those curves is that it reaches a plateau meaning no new OTUs are discovered (going up on the Y axis) as sequencing is deeper (the X axis). If this happens we can be pretty confident that we sequenced all the OTUs that were present in the sample.

install.packages("remotes")
remotes::install_github("gauravsk/ranacapa")
# Define the list of samples to remove
samples_to_remove <- c("GI-103", "GI-59", "GI-104", "GI-111", "GI-121", 
                       "GI-124", "GI-125", "GI-14", "GI-15", "GI-16", 
                       "GI-17", "GI-25", "GI-26", "GI-27", "GI-28", 
                       "GI-29", "GI-30", "GI-31", "GI-32", "GI-33", 
                       "GI-34", "GI-36", "GI-38", "GI-39", "GI-45", "GI-47", "GI-51", "I9", "I7")

# Prune these samples from your phyloseq object
physeq_cleaned <- prune_samples(!(sample_names(merged_metagenomes) %in% samples_to_remove), merged_metagenomes)


ranacapa::ggrare(physeq_cleaned, step = 10, color = "Sample", se = FALSE) +
    geom_vline(xintercept = min(sample_sums(pseq)), color = "gray60")
## rarefying sample GI-52
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 36
## rarefying sample GI-53
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 25
## rarefying sample GI-54
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 39
## rarefying sample GI-56
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 26
## rarefying sample GI-61
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 19
## rarefying sample GI-62
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 21
## rarefying sample GI-63
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 29
## rarefying sample GI-64
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 112
## rarefying sample GI-66
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 18
## rarefying sample GI-67
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 17
## rarefying sample GI-69
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 18
## rarefying sample GI-70
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 30
## rarefying sample GI-71
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 24
## rarefying sample GI-72
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 19
## rarefying sample GI-73
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 16
## rarefying sample GI-75
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 19
## rarefying sample GI-79
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 29
## rarefying sample GI-7
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 29
## rarefying sample GI-80
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 18
## rarefying sample GI-85
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 63
## rarefying sample GI-86
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 26
## rarefying sample GI-88
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 32
## rarefying sample I10
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 45
## rarefying sample I111
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 22
## rarefying sample I113
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 17
## rarefying sample I117
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 22
## rarefying sample I11
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 16
## rarefying sample I13
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 19
## rarefying sample I14
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 24
## rarefying sample I15
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 16
## rarefying sample I16
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 55
## rarefying sample I17
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 59
## rarefying sample I19
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 36
## rarefying sample I24
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 16
## rarefying sample I28
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 26
## rarefying sample I29
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 21
## rarefying sample I2
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 22
## rarefying sample I35
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 24
## rarefying sample I36
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 21
## rarefying sample I3
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 22
## rarefying sample I4
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 19
## rarefying sample I50
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 18
## rarefying sample I51
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 20
## rarefying sample I52
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 21
## rarefying sample I56
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 21
## rarefying sample I57
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 65
## rarefying sample I5
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 20
## rarefying sample I6
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 16
## rarefying sample I8
## Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
## data have counts 1, but smallest count is 17

Q: What do you observe?

5.2 Relative abundance (%) - Reduced dataset

pseq <- microbiome::transform(physeq_cleaned, transform = "compositional")

n <- dim(tax_table(pseq))[1]
qual_col_pals = brewer.pal.info[brewer.pal.info$category == 'qual',]
col_vector = unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))
col_vector <- rep(col_vector, length.out = 84)
# Create a plot of the taxonomic composition at the Genus level
p <- plot_composition(pseq,
                      taxonomic.level = "Phylum", # Specify the taxonomic level for the plot
                      sample.sort = "Sample",    # Sort samples by sample identifier
                      x.label = "Sample",        # Label for the x-axis
                      otu.sort = "abundance",    # Sort OTUs by their abundance
                      group_by = "sex") +       # Group samples by the 'Type' variable

  # Use a color palette from the Color Brewer for filling the Genus groups
  scale_fill_manual(values = col_vector) +

  # Arrange the legend items in a single column
  guides(fill = guide_legend(ncol = 1)) +

  # Convert the y-axis to represent relative abundance as a percentage
  scale_y_percent() +

  # Add labels and a title to the plot
  labs(x = "Samples", y = "Relative abundance (%)",
       title = "Relative abundance - Reduced dataset") + 

  # Apply a clean theme with a horizontal grid
  theme_ipsum(grid = "Y") +
  # Customize the theme for x-axis text and legend text
  theme(axis.text.x = element_text(angle = 90, hjust = 1), # Rotate x-axis text 90 degrees for readability
  legend.text = element_text(face = "italic"))       # Italicize the legend text

# Print the plot
print(p)